perf: move EG() and CG() in ZTS builds into __thread storage#22231
perf: move EG() and CG() in ZTS builds into __thread storage#22231henderkes wants to merge 4 commits into
Conversation
|
Meant to write comments explaining a few peculiar decisions... but got carried away in the evening. Will get to it and tag relevant reviewers in the morning. |
|
|
||
| static void ts_free_resources(tsrm_tls_entry *thread_resources) | ||
| { | ||
| bool own_thread = thread_resources->thread_id == tsrm_thread_id(); |
There was a problem hiding this comment.
Here it's possible that a thread id was recycled, but the tls points to a now obsolete one
| } | ||
|
|
||
| if (!resource_types_table[i].fast_offset) { | ||
| if (!resource_types_table[i].fast_offset && !resource_types_table[i].tls_addr) { |
There was a problem hiding this comment.
we can't manually free __thread storage
| /* allocates a resource id whose per-thread storage is a native __thread block */ | ||
| TSRM_API ts_rsrc_id ts_allocate_tls_id(ts_rsrc_id *rsrc_id, void *(*tls_addr)(void), size_t size, ts_allocate_ctor ctor, ts_allocate_dtor dtor) | ||
| {/*{{{*/ | ||
| TSRM_ERROR((TSRM_ERROR_LEVEL_CORE, "Obtaining a new TLS resource id, %d bytes", size)); |
There was a problem hiding this comment.
function largely copied from above, looking at it now I see that size_t should be printed as %zu.
| # define TSRM_TLS_MODEL_ATTR | ||
| # define TSRM_TLS_MODEL_DEFAULT | ||
| #elif __PIC__ | ||
| #elif __PIC__ && !defined(__PIE__) |
There was a problem hiding this comment.
a PIE program can use local exec if it's the main executable. Only shared libraries (embed, extensions) need to fall back to initial-exed.
This alone would already be a small speedup (one fewer instruction per access)
| AS_VAR_APPEND([CFLAGS], [" -DZEND_EG_TLS"]) | ||
|
|
||
| dnl -mtls-size=12 drops the dead high-bits offset add from TLS access, | ||
| dnl valid while the thread-local block stays under 4 KiB. |
There was a problem hiding this comment.
This would produce linker errors if tls size exceeded 4kb, but I did a test atatic build with 100 extensions statically compiled in (what a terrible idea) and tls size ended up at 3.7kb. I can't think of a way to test whether a link would succeed before compiling, so this is unconditional.
|
@arnaud-lb I also had a different idea on achieving the same thing by pinning resolved EG and CG to a register on gcc. It worked well, but
Perhaps you have a different idea here? |
There was a problem hiding this comment.
This is a good idea, but you will need to check that it works in various configurations:
- Accessing TLS variables from other modules (shared extensions) used to be broken on some platforms (module wouldn't link or wouldn't load). I don't know if that's still the case, but that's one reason why extensions have their own
_tsrm_ls_cacheinstead of accessing the main one. - local-exec / initial-exec TLS when php is itself a shared library (apache module, embed) is allocated from the TLS surplus which is non portable and limited in size. The size of EG+CG might be larger than surplus. I don't remember why we specify the model explicitly but the loader patches the code to use more efficient models when applicable, so this may not be necessary. (If removing TSRM_TLS_MODEL_ATTR breaks JIT I can help fixing it.)
Register pinning on the hybrid VM would work, but it may increase register pressure. I've tried adding new VM arguments recently for another reason, and it resulted in a performance regression. Global registers would work too, as long as it's a callee-saved register (prevents clobbering by 3rd party libs) (but this increases register pressure everywhere the variable is in scope).
Some idea: EG()/CG() and a few others use ZEND_TSRMG_FAST(), which is _tsrm_ls_cache+some_offset. So EG(foo) is load(_tsrm_ls_cache)+load(some_offset)+offsetof(zend_executor_globals,foo). We could try to make the offset a const expr so that it folds to load(_tsrm_ls_cache)+const:
- Create a struct that contains all fast TLS entries (see php_reserve_tsrm_memory)
- Replace
php_reserve_tsrm_memory()bytsrm_reserve(sizeof(the_struct)). #define EG(v) ZEND_TSRMG_FAST(offsetof(the_struct, eg), zend_executor_globals *, v)
| ZEND_API TSRM_TLS TSRM_TLS_MODEL_ATTR zend_executor_globals executor_globals_tls; | ||
| ZEND_API TSRM_TLS TSRM_TLS_MODEL_ATTR zend_compiler_globals compiler_globals_tls; |
There was a problem hiding this comment.
Possibly these could be embeded in the main _tsrm_ls_cache (TSRMLS_MAIN_CACHE_DEFINE) :
ZEND_API TSRM_TLS TSRM_TLS_MODEL_ATTR struct {
void *cache;
zend_executor_globals eg;
zend_compiler_globals cg;
} _tsrm_ls_cache;
This would simplify JIT changes as we can access eg/cg at an offset from jit->tls.
Also the compiler may generate better code when both EG and CG are used in the same function.
There was a problem hiding this comment.
That's a great idea, I'll give it a spin.
There was a problem hiding this comment.
Looking at it, it introduces tsrm macro changes in many places that assume _tsrm_ls_cache is a void*. Could deref address of .cache to keep callers as void* consumers without pulling in the whole struct.
It simplifies the jit a lot though and leaves the ability to move more things later, so I'll probably do it later.
There was a problem hiding this comment.
I'll need to redo the jit by hand, llm missed too many crucial parts and happily regressed if ie wasn't available (aka macos and Windows).
I tested all built in extensions as shared and the test suites ran. With the latest change I also enabled this for Windows though, so I'll need to get back to the benching and testing board.
Hmm, I haven't tested the apache module, but our simple embed test and frankenphp work fine.
It's permitted to, but not guaranteed to. Clang is pretty good at it, but in GCC aarch64 instruction count goes up 3% by compiling as PIC (without this patch). As for the surplus, this wouldn't change here, as EG/CG are part of the initial load set. Anything that fit in the surplus before will still fit now. The only exception to this would be something like
I did this with the gcc register to get rid of the two dependent loads and the offset and it worked, using fewer instructions than NTS, but it relies on global registers. Toyed around ideas for Clang but they weren't satisfactory. We could of course use the approach without global register variables, but then it only gets rid of an independent offset load while keeping the two dependent loads, while this PR's approach already brings it down to a single load. It would however be worth doing alongside this PR, just for the other tsrm globals, even though they're accessed so rarely I'm not sure it really matters. |
This is the case of the apache module |
So, I was chasing what caused the much higher instruction count on aarch64 for clang vs gcc and while figuring that out, got side tracked to figuring out that ZTS builds also have a ton of extra instructions compared to NTS builds. Turned out it was mostly just from constantly accessing executor and compiler globals when running phpstan. Each access became pseudo assembly:
Here we're slashing one extra load and one add by moving EG and CG onto actual thread storage. Benchmarks are using phpstan on full laravel codebase (full symfony codebase showed the same, but I added the LE optimization after and don't want to rerun 6 hours of benchmarks) and phoronix-test-suite phpbench.
aarch64:
x86_64:
LLM disclosure
claude